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• ■ Abstract 

O . This paper describes a computationally feasible approximation to the AIXI 

agent, a universal reinforcement learning agent for arbitrary environments. AIXI is 
scaled down in two key ways: First, the class of environment models is restricted 
to all prediction suffix trees of a fixed maximum depth. This allows a Bayesian 
Q \ mixture of environment models to be computed in time proportional to the logarithm 

OO ' of the size of the model class. Secondly, the finite-horizon expectimax search is 

approximated by an asymptotically convergent Monte Carlo Tree Search technique. 
This scaled down AIXI agent is empirically shown to be effective on a wide class 



p 

' of toy problem domains, ranging from simple fully observable games to small 

O ■ POMDPs. We explore the limits of this approximate agent and propose a general 

heuristic framework for scaling this technique to much larger problems. 
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1 Introduction 

A main difficulty of doing research in artificial general intelligence has always been in 
defining exactly what artificial general intelligence means. There are many possible def- 
initions [LH07], but the AIXI formulation [Hut05] seems to capture in concrete quantita- 
tive terms many of the qualitative attributes usually associated with intelligence. 

The general reinforcement learning problem. Consider an agent that exists within 
some (unknown to the agent) environment. The agent interacts with the environment in 
cycles. At each cycle, the agent executes an action and receives in turn an observation 
and a reward. There is no explicit notion of state, neither with respect to the environment 
nor internally to the agent. The general reinforcement learning problem is to construct an 
agent that, over time, collects as much reward as possible in this setting. 

The AIXI agent. The AIXI agent is a mathematical solution to the general reinforce- 
ment learning problem. The AIXI setup mirrors that of the general reinforcement prob- 
lem, however the environment is assumed to be an unknown but computable function; 
i.e. the observations and rewards received by the agent given its actions can be computed 
by a Turing machine. Furthermore, the AIXI agent is assumed to exist for a finite, but 
arbitrarily large amount of time. The AIXI agent results from a synthesis of two ideas: 

1. the use of a finite-horizon expectimax operation from sequential decision theory for 
action selection; and 

2. an extension of Solomonoff 's universal induction scheme [Sol64] for future predic- 
tion in the agent context. 

More formally, let U(q,a\a2 ■ ■ .a„) denote the output of a universal Turing machine 
U supplied with program q and input aiaj . . .an, m 6 N a finite lookahead horizon, 
and i(q) the length in bits of program q. The action picked by AIXI at time t, hav- 
ing executed actions and received the sequence of observation-reward pairs 
Oi?"i02?"2 . . . Of-irt-i from the environment, is given by: 



Intuitively, the agent considers the sum of the total reward over all possible futures (up 
to m steps ahead), weighs each of them by the complexity of programs (consistent with 
the agent's past) that can generate that future, and then picks the action that maximises 
expected future rewards. Equation (1) embodies in one line the major ideas of Bayes, Ock- 
ham, Epicurus, Turing, von Neumann, Bellman, Kolmogorov, and Solomonoff. The AIXI 
agent is rigorously shown in [Hut05] to be optimal in different senses of the word. (Tech- 
nically, AIXI is Pareto optimal and 'self-optimising' in different classes of environment.) 
In particular, the AIXI agent will rapidly learn an accurate model of the environment and 
proceed to act optimally to achieve its goal. 




q:U{q,ai. . .a,+„,)=oi ri . . .o,+,„ r,+, 




(1) 
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The AIXI formulation also takes into account stochastic environments because Equa- 
tion (1) can be shown to be formally equivalent to the following expression: 



a* = arg max } ... max > in + ■ ■ ■ + n+nA } ^ 2 ^'p(oiri . . . Ot+,„rt+„, | aj . . . 



where p(oiri . . . Ot+mft+m I . . . Ut+m) is the probability of o\ri . . . Ot+mft+m given actions 
ai ...Ut+m. Class M consists of all enumerable chronological semimeasures [Hut05], 
which includes all computable p, and K{p) denotes the Kolmogorov complexity of p 
[LV08]. 

An accessible overview of the AIXI agent can be found in [Leg08]. A complete 
description of the agent is given in [Hut05]. 

AIXI as a principle. The AIXI formulation is best understood as a rigorous definition 
of optimal decision making in general unknown environments, and not as an algorithmic 
solution to the general AI problem. (AIXI after all, is only asymptotically computable.) 
As such, its role in general AI research should be viewed as, for example, the same way 
the minimax and empirical risk minimisation principles are viewed in decision theory and 
statistical machine learning research. These principles define what is optimal behaviour if 
computational complexity is not an issue, and can provide important theoretical guidance 
in the design of practical algorithms. It is in this light that we see AIXI. This paper is an 
attempt to scale AIXI down to produce a practical agent that can perform well in a wide 
range of different, unknown and potentially noisy environments. 

Approximating AIXI. As can be seen in Equation (1), there are two parts to AIXI. The 
first is the expectimax search into the future which we will call planning. The second 
is the use of a Bayesian mixture over Turing machines to predict future observations 
and rewards based on past experience; we will call that learning. Both parts need to be 
approximated for computational tractability. There are many different approaches one can 
try. In this paper, we opted to use a generalised version of the UCT algorithm [KS06] for 
planning and a generalised version of the Context Tree Weighting algorithm [WST95] for 
learning. This harmonious combination of ideas, together with the attendant theoretical 
and experimental results, form the main contribution of this paper. 

Paper organisation. The paper is organised as follows. Section 2 describes the basic 
agent setting and discusses some design issues. Section 3 then presents a Monte Carlo 
Tree Search procedure that we will use to approximate the expectimax operation in AIXI. 
This is followed by a description of the context tree weighting algorithm and how it can 
be generalised for use in the agent setting in Section 4. We put the two ideas together 
in Section 5 to form our agent algorithm. Theoretical and experimental results are then 
presented in Sections 6 and 7. We end with a discussion of related work and other topics 
in Section 8. 





(2) 
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2 The Agent Setting and Some Design Issues 



Notation. A string ^1X2 ...Xn of length n is denoted by x\-n. The prefix x\-j of x\-n, 
j < n, is denoted by x<j or x<j+i. The notation generalises for blocks of symbols: e.g. 
axi-n denotes and ax<j denotes The empty string is 

denoted by e. The concatenation of two strings s and r is denoted by sr. 

Agent setting. The (finite) action, observation, and reward spaces are denoted by O, 
and respectively. Also, X denotes the joint perception space O xK. 

Definition 1. A history is a string h 6 x X)", for some n>Q. A partial history is the 
prefix of some history. 

The set of all history strings of maximum length n will be denoted by x X)-" . 

The following definition states that the agent's model of the environment takes the 
form of a probability distribution over possible observation-reward sequences conditioned 
on actions taken by the agent. 

Definition 2. An environment model p is a sequence of functions {po,pi, . . .}, p„: JTI" — > 
Density (X"), that satisfies: 

1. Vai:„Vx<„ : Pn(x<n \ ci<n) = 2jc„eA' Pn(Xl:n \ (^l:n) 

2. V(3<„Vjc<„ : p„(x<„ I a<„) > 0. 

The first condition (called the chronological condition in [Hut05]) captures the natural 
constraint that action a„ has no eff"ect on observations made before it. The second con- 
dition enforces the requirement that the probability of every possible observation-reward 
sequence is non-zero. This ensures that conditional probabilities are always defined. It is 
not a serious restriction in practice, as probabilities can get arbitrarily small. For conve- 
nience, we drop the index t in p, from here onwards. 

Given an environment model p, we have the following identities: 



Reward, policy and value functions. We represent the notion of reward as a numeric 
value that represents the magnitude of instantaneous pleasure experienced by the agent at 
any given time step. Our agent is a hedonist; its goal is to accumulate as much reward 
as it can during its lifetime. More precisely, in our setting the agent is only interested in 
maximising its future reward up to a fixed, finite, but arbitrarily large horizon m e N. 

In order to act rationally, our agent seeks a policy that will allow it to maximise its 
future reward. Formally, a policy is a function that maps a history to an action. If we define 
Rk{aor<t) := r^ for I < k < t, then we have the following definition for the expected future 
value of an agent acting under a particular policy: 



P\^<n I ^<n) 

p{Xl:n I ai;n) = p{Xi \ ai)p{X2 \ aiXifla) • • -piXn I aX^nttn) 




(3) 



(4) 
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Definition 3. Given history ax<t, the m-horizon expected future reward of an agent acting 
under policy n: (Jlx Jl with respect to an environment model p is: 



v'"(n,ax^t) ■-- 



7 , RiiP-^<t+m) 



i=t 



(5) 



where for t < k < t + m, ak := n{ax^k)- The quantity v'p(n,ax^tat) is defined similarly, 
except that at is now no longer defined by n. 

The optimal policy n* is the policy that maximises Equation (5). The maximal achiev- 
able expected future reward of an agent with history h e {J{.x X)''^ in environment p 
looking m steps ahead is := v"'(;r*, h). It is easy to see that 



v:{h) 



max 

a. 



V p{xt I ha,) • ■ • max V p(Xf+,„ | /?a;c„+„,^i af+,„) V 



(6) 



All of our subsequent efforts can be viewed as attempting to define an algorithm that 
determines a policy as close to the optimal policy as possible given reasonable resource 
constraints. Our agent is model based: we learn a model of the environment and use it to 
estimate the future value of our various actions at each time step. These estimates allow 
the agent to make an approximate best action given limited computational resources. 

We now discuss some high-level design issues before presenting our algorithm in the 
next section. 



Perceptual aliasing. A major problem in general reinforcement learning is perceptual 
aliasing [Chr92], which refers to the situation where the instantaneous perceptual infor- 
mation (a single observation in our setting) does not provide enough information for the 
agent to act optimally. This problem is closely related to the question of what constitutes 
a state, an issue we discuss next. 

State vs history based agents. A Markov state [SB98] provides a sufficient statistic for 
all future observations, and therefore provides sufficient information to represent optimal 
behaviour. No perceptual aliasing can occur with a Markov state. In Markov Decision 
Processes (MDPs) and Partially Observable Markov Decision Processes (POMDPs) all 
underlying environmental states are Markov. 

A compact state representation is often assumed to generalise well and therefore en- 
able efficient learning and planning. A common approach in reinforcement learning (RL) 
[SB98] is to approximate the environmental state by using a small number of handcrafted 
features. However, this approach requires both that the environmental state is known, and 
that sufficient domain knowledge is available to select the features. 

In the general RL problem, neither the states nor the domain properties are known in 
advance. One approach to general RL is to find a compact representation of state that 
is approximately Markov [McC96, ShaOV, SJR04, ST04], or a compact representation of 
state that maximises some performance criterion [Hut09b, Hut09a]. In practice, a Markov 
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representation is rarely achieved in complex domains, and these methods must introduce 
some approximation, and therefore some level of perceptual aliasing. 

In contrast, we focus on learning and planning methods that use the agent's history 
as its representation of state. A history representation can be generally applied without 
any domain knowledge. Importantly, a history representation requires no approximation 
and introduces no aliasing: each history is a perfect Markov state (or fc-Markov for length 
k histories). In return for these advantages, we give up on compactness. The number of 
states in a history representation is exponential in the horizon length (or k for length k 
histories), and many of these histories may be equivalent. Nevertheless, a history rep- 
resentation can sometimes be more compact than the environmental state, as it ignores 
extraneous factors that do not affect the agent's direct observations. 

Predictive environment models. In order to form non-trivial plans that span multiple 
time steps, our agent needs to be able to predict the effects of its interaction with the 
environment. If a model of the environment is known, search-based methods offer one 
way of generating such plans. However, a general RL agent does not start with a model of 
the environment; it must learn one over time. Our agent builds an approximate model of 
the true environment from the experience it gathers when interacting with the real world, 
and uses it for online planning. 

Approximation via online planning. If the problem is small, model-based RL methods 
such as Value Iteration for MDPs can easily derive an optimal policy. However this is not 
appropriate for the larger problems more typical of the real world. Local search is one 
way to address this problem. Instead of solving the problem in its entirety, an approximate 
solution is computed before each decision is made. This approach has met with much 
success on difficult decision problems within the game playing research community and 
on large- sized POMDPs [RPPCD08]. 

Scalability. The general RL problem is extremely difficult. On any real world prob- 
lem, an agent is necessarily restricted to making approximately correct decisions. One of 
the distinguishing features of sophisticated heuristic decision making frameworks, such as 
those used in computer chess or computer go, is the ability of these frameworks to provide 
acceptable performance on hardware ranging from mobile phones through to supercom- 
puters. To take advantage of the fast-paced advances in computer technology, we claim 
that a good autonomous agent framework should naturally and automatically scales with 
increasing computational resources. Both the learning and planning components of our 
approximate AIXI agent have been designed with scalability in mind. 

Anytime decision making. One of the key resources in real world decision making is 
time. As we are interested in a practical general agent framework, it is imperative that 
our agent be able to make good approximate decisions on demand. Different application 
domains have different real- world time constraints. We seek an agent framework that 
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can make good, approximate decisions given anything from 10 milliseconds to 10 days 
thinking time per action. 

3 Monte Carlo Tree Search with Model Updates 

In this section we describe Predictive UCT, a Monte Carlo Tree Search (MCTS) technique 
for stochastic, partially observable domains that uses an incrementally updated environ- 
ment model p to predict and evaluate the possible outcomes of future action sequences. 

The Predictive UCT algorithm is a straightforward generalisation of the UCT algo- 
rithm [KS06], a Monte Carlo planning algorithm that has proven effective in solving large 
state space discounted, or finite horizon MDPs. The generalisation requires two parts: 

• The use of an environment model that is conditioned on the agent's history, rather 
than a Markov state. 

• The updating of the environment model during search. This is essential for the algo- 
rithm to utilise the extra information an agent will have at a hypothetical, particular 
future time point. 

The generalisation involves a change in perspective which has significant practical 
ramifications in the context of general RL agents. Our extensions to UCT allow Predic- 
tive UCT, in combination with a sufficiently powerful predictive environment model p, 
to implicitly take into account the value of information in search and be applicable to 
partially observable domains. 

Overview. Predictive UCT is a best-first Monte Carlo Tree Search technique that itera- 
tively constructs a search tree in memory. The tree is composed of two interleaved types 
of nodes: decision nodes and chance nodes. These correspond to the alternating max and 
2 operations in expectimax. Each node in the tree corresponds to a (partial) history h. If 
h ends with an action, it is a chance node; if h ends with an observation, it is a decision 
node. Each node contains a statistical estimate of the future reward. 

Initially, the tree starts with a single decision node containing children. Much like 
in existing MCTS methods [CWU^OS], there are four conceptual phases to a single itera- 
tion of Predictive UCT. The first is the selection phase, where the search tree is traversed 
from the root node to an existing leaf chance node n. The second is the expansion phase, 
where a new decision node is added as a child to n. The third is the simulation phase, 
where a playout policy in conjunction with the environment model p is used to sample 
a possible future path from n until a fixed distance from the root is reached. Finally, the 
backpropagation phase updates the value estimates for each node on the reverse trajec- 
tory leading back to the root. Whilst time remains, these four conceptual operations are 
repeated. Once the time limit is reached, an approximate best action can be selected by 
looking at the value estimates of the children of the root node. 

During the selection phase, action selection at decision nodes is done using a policy 
that balances exploration and exploitation. This policy has two main effects: 
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Figure 1 : A Predictive UCT search tree 

• to move the estimates of the future reward towards the maximum attainable future 
reward if the agent acted optimally. 

• to cause asymmetric growth of the search tree towards areas that have high predicted 
reward, implicitly pruning large parts of the search space. 

The future reward at leaf nodes is estimated by choosing actions according to a heuris- 
tic policy until a total of m actions have been made by the agent, where m is the search 
horizon. This heuristic estimate helps the agent to focus its exploration on useful parts 
of the search tree, and in practice allows for a much larger horizon than a brute-force 
expectimax search. 

Predictive UCT builds a sparse search tree in the sense that observations are only 
added to chance nodes once they have been generated along some sample path. A full 
expectimax search tree would not be sparse; each possible stochastic outcome will be 
represented by a distinct node in the search tree. For expectimax, the branching factor 
at chance nodes is thus \0\, which means that searching to even moderate sized m is 
intractable. 

Figure 1 shows an example Predictive UCT tree. Chance nodes are denoted with stars. 
Decision nodes are denoted by circles. The dashed lines from a star node indicate that not 
all of the children have been expanded. The squiggly line at the base of the leftmost 
leaf denotes the execution of a playout policy. The arrows proceeding up from this node 
indicate the flow of information back up the tree; this is defined in more detail in Section 
3. 

Action selection at decision nodes. A decision node will always contain |^| distinct 
children, all of whom are chance nodes. Associated with each decision node representing 
a particular history h will be a value function estimate, V{h). During the selection phase, 
a child will need to be picked for further exploration. Action selection in MCTS poses a 
classic exploration/exploitation dilemma. On one hand we need to allocate enough visits 
to all children to ensure that we have accurate estimates for them, but on the other hand 
we need to allocate enough visits to the maximal action to ensure convergence of the node 
to the value of the maximal child node. 
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Like UCT, Predictive UCT recursively uses the UCB policy [Aue02] from the «-armed 
bandit setting at each decision node to determine which action needs further exploration. 
Although the uniform logarithmic regret bound no longer carries across from the bandit 
setting, the UCB policy has been shown to work well in practice in complex domains 
such as Computer Go [GW06] and General Game Playing [FB08]. This policy has the 
advantage of ensuring that at each decision node, every action eventually gets explored 
an infinite number of times, with the best action being selected exponentially more often 
than actions of lesser utility. 

Definition 4. The visit count T{h) of a decision node h is the number of times h has been 
sampled by the Predictive UCT algorithm. The visit count of the chance node found by 
taking action a at h is defined similarly, and is denoted by T(ha). 

Definition 5. Suppose m is the search horizon and each single time-step reward is 
bounded in the interval [a,J3]. Given a node representing a history h in the search tree, 
the action picked by the UCB action selection policy is: 

f^^1>(/?a) + C Ji^igM ifT{ha)>Q; 
6! c/cbW := arg max <^ '"(^-"') ^ \ ^C'") j \ j ^ 

ae^ loo Otherwise, 

where C eRis a positive parameter that controls the ratio of exploration to exploitation. 
If there are multiple maximal actions, one is chosen uniformly at random. 

Note that we need a linear scaling of VQia) in Definition 5 because the UCB policy is 
only applicable for rewards confined to the [0, 1] interval. 



Chance nodes. Chance nodes follow immediately after an action is selected from a 
decision node. Each chance node ha following a decision node h contains an estimate of 
the future utility denoted by V(ha). Also associated with the chance node ha is a density 
p(- 1 ha) over observation-reward pairs. 

After an action a is performed at node h, p(- 1 ha) is sampled once to generate the next 
observation-reward pair or. If o has not been seen before, the node hao is added as a child 
of ha. We will use the notation (9/,^ to denote the subset of O representing the children of 
partial history ha created so far. 



Estimating future reward at leaf nodes. If a leaf decision node is encountered at depth 
k < mm the tree, a means of estimating the future reward for the remaining m - k time 
steps is required. The agent applies its heuristic playout function 11 to estimate the sum 
of future rewards Y!iLk ^i- ^ particularly simple, pessimistic baseline playout function is 
U-random, which chooscs an action uniformly at random at each time step. 

A more sophisticated playout function that uses action probabilities estimated from 
previously taken real- world actions could potentially provide a better estimate. The qual- 
ity of the actions suggested by such a predictor can be expected to improve over time, 
since it is trying to predict actions that are chosen by the agent after a Predictive UCT 
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search. This powerful and intuitive method of constructing a generic heuristic will be 
explored further in a subsequent section. 

Asymptotically, the heuristic playout policy makes no contribution to the value func- 
tion estimates of Predictive UCT. When the remaining depth is zero, the playout policy 
always returns zero reward. As the number of simulations tends to infinity, the struc- 
ture of the Predictive UCT search tree is equivalent to the exact depth m expectimax tree 
with high probability. This implies that the asymptotic value function estimates of Pre- 
dictive UCT are invariant to the choice of playout function. However, when search time 
is limited, the choice of playout policy will be a major determining factor of the overall 
performance of the agent. 

Reward backup. After the selection phase is completed, a path of nodes n\n2. . .Uk, 
k < m, will have been traversed from the root of the search tree rii to some leaf ti/,. For 
each 1 < j < k, the statistics maintained for (partial) history h^. associated with node rij 
will be updated as follows: 

Tihn^ . 1 A 
V(hn ) < — V(K ) + > n (8) 

Tihn) ^ T{hn) + 1 (9) 

Note that the same backup equations are applied to both decision and chance nodes. 

Incremental model updating. Recall from Definition 2 that an environment model p 
is a sequence of functions {po,pi,p2, . . .}, where : — > Density {X'). When invoking 
the Sample routine to decide on an action, many hypothetical future experiences will be 
generated, with pt being used to simulate the environment at time t. For the algorithm to 
work well in practice, we need to be able to perform the following two operations in time 
sublinear with respect to the length of the agent's entire experience string. 

• Update - given ptixi;t\ai-t),at+i, and x^+i, produce p,+i(jci:,+i \ ai-t+i) 

• Revert - given Pf+i(.)Ci:f+i | ai-j+i), recover p^xi:, | ai-j) 

The revert operation is needed to restore the environment model to pt after each simu- 
lation to time t + mis performed. In Section 4, we will show how these requirements can 
be met efficiently by a certain kind of Bayesian mixture over a rich model class. 

Pseudocode. We now give the pseudocode of the entire Predictive UCT algorithm. 

Algorithm 1 is responsible for determining an approximate best action. Given the 
current history h, it first constructs a search tree containing estimates V'p(ha) for each 
a e Jl, and then selects a maximising action. An important property of Algorithm 1 is 
that it is anytime; an approximate best action is always available, whose quality improves 
with extra computation time. 
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Algorithm 1 Predictive UCT(p, h, m) 
Require: An environment model p 
Require: A history h 
Require: A search horizon m e N 

1: Initialise(*P) 

2: repeat 

3: Sample(¥, h, m) 
4: p <— Revert(p, m) 

5: until out of time 

6: return BestAction(*I', /?) 



For simplicity of exposition, Initialise can be understood to simply clear the entire 
search tree In practice, it is possible to carry across information from one time step to 
another. If is the search tree obtained at the end of time t, and aor is the agent's actual 
action and experience at time t, then we can keep the subtree rooted at node ^,{hao) in 

and make that the search tree for use at the beginning of the next time step. The 
remainder of the nodes in can then be deleted. 

As a Monte Carlo Tree Search routine. Algorithm 1 is embarrassingly parallel. The 
main idea is to concurrently invoke the Sample routine whilst providing appropriate lock- 
ing mechanisms for the nodes in the search tree. An efficient parallel implementation is 
beyond the scope of the paper, but it is worth noting that ideas [CWH08] applicable to 
high performance Monte Carlo Go programs are easily transferred to our setting. 

Algorithm 2 implements a single run through some trajectory in the search tree. It 
uses the SelectAction routine to choose moves at interior nodes, and invokes the playout 
policy at unexplored leaf nodes. After a complete path of length m is completed, the 
recursion takes care that every visited node along the path to the leaf is updated as per 
Section 3. 

The action chosen by SelectAction is specified by the UCB policy described in Def- 
inition 5. If the selected child has not been explored before, then a new node is added 
to the search tree. The constant C is a parameter that is used to control the shape of the 
search tree; lower values of C create deep, selective search trees, whilst higher values lead 
to shorter, bushier trees. 

4 Extensions of Context Tree Weighting 

Context Tree Weighting (CTW) [WST95, WST97] is a theoretically well-motivated on- 
line binary sequence prediction algorithm that works well in practice [BEYY04]. It is 
an online Bayesian model averaging algorithm that computes a mixture of all prediction 
suffix trees [RST96] of a given bounded depth, with higher prior weight given to simpler 
models. We examine in this section several extensions of CTW needed for its use in the 
context of agents. Along the way, we will describe the CTW algorithm in detail. 
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Algorithm 2 Sample(p, *F, h, m) 

Require: An environment model p 

Require: A search tree *F 

Require: A (partial) history h 

Require: A remaining search horizon m 6 N 

1: if 7« = then 

2: return 

3: else if is a chance node then 

4: Generate (o, r) from p{or \ h) 

5: Create node ^(hor) if TQior) = 

6: reward <— r + Sample(p, ^, hor, m - 1) 

7: else if = then 

8: reward <— Playout(p, /z, m) 

9: else 

10: a <— SelectAction(^, /z) 

1 1 : reward <— S ample(p, ^, /?a, m) 

12: end if 

13: V(h) ^ ^[reward + T(h)Vm 

14: Tih) ^ TQi) + 1 

15: return reward 



Algorithm 3 SelectAction(*F, h) 
Require: A search tree *F 
Require: A history h 

Require: An exploration/exploitation constant C 

1: nj = {ae^: TQia) = 0} 

2: if 'W^ {}then 

3: Pick a eti uniformly at random 
4: Create node *P(/za) 
5: return a 

6: else 

7: return arg max (-^,V{ha) + C . /^^f^^) 

8: end if 



Action-conditional CTW. We first look at how CTW can be generalised for use as envi- 
ronment models (Definition 2), which are functions of the form p„ : Jl" — > Density (X"). 
This means we need an extension of CTW that, incrementally, takes as input a sequence 
of actions and produces as output successive conditional probabilities over observations 
and rewards. The high-level view of the algorithm is as follows: we process observations 
and rewards one bit at a time using standard CTW, but bits representing actions are simply 
appended to the input sequence without updating the context tree. The algorithm is now 
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Algorithm 4 Playout(p, h, m) 
Require: An environment model p 
Require: A history h 

Require: A remaining search horizon m e N 
Require: A play out function H 

1: reward 

2; for i = 1 to m do 

3: Generate a from liQi) 

4: Generate (o, r) from p{or \ ha) 

5: reward <— reward + r 

6: /? <— haor 

7: end for 

8: return reward 



described in detail. If we drop the action sequence throughout the following description, 
the algorithm reduces to the standard CTW algorithm. 

Krichevsky-Trofimov estimator. We start with a brief review of the KT estimator 
[KT81] for Bernoulli distributions. Given a binary string yi-, with a zeroes and b ones, the 
KT estimate of the probability of the next symbol is as follows: 

Prfe(y,+i = l|yi:r):= (10) 

a-\- b + I 

Pr,,(y,^i =0|3;i,) := 1 - Pr,,(n+i = HJi.). (H) 

The KT estimator is obtained via a Bayesian analysis by putting a ^)-Beta prior on the 
parameter of the Bernoulli distribution. From (lO)-(l 1), we obtain the following expres- 
sion for the block probability of a string: 

Pr/tr(yi:r) = Pr^rCyi I f)Pr,t,(y2 I Jl) • • -PrfaCyr 

Given a binary string s, one can establish that Vxkt{s) depends only on the number of 
zeroes as and ones bs in s. If we let O^l'' denote a string with a zeroes and b ones then: 

Pr.,(.) = Pr.,(0'- 1'-) = (12) 

(a, + b,)\ 

We write Vxy{a,b) to denote Pr^XO^l'O in the following. The quantity Fri,t{a,b) can be 
updated incrementally as follows: 

a+ 1/2 

Pr,,(a +l,b) = r^P^ktia, b) (13) 

a + b +1 

Vxkt{a,b+ 1) = Pr,,(a,Z7), (14) 

a + b + I 

with the base case being Pr,t?(0, 0) = 1 . 
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Figure 2: An example prediction suffix tree 



Prediction Suffix Trees. We next describe prediction suffix trees, which are a form of 
variable-order Markov models. 

Definition 6. A prediction suffix tree (PST) is a pair (M, 0) satisfying the following : 

1. M is a binary tree where the left and right edges are labelled 1 and respectively; 
and 

2. associated with each leaf node I in M is a probability distribution over {0, 1 } pa- 
rameterised by 6i e Q ( the probability of 1). 

We call M the model of the PST and the parameter of the PST, in accordance with the 
terminology of [WST95], . 

A prediction suffix tree (M, 0) maps each binary string yi„, where n > the depth of 
M, to a probability distribution over {0, 1} in the natural way: we traverse the model M 
by moving left or right at depth d depending on whether the bit is one or zero until 
we reach a leaf node / in M, at which time we return 6i. For example, the PST shown 
in Figure 2 maps the string 1 10 to 9io = 0.3. At the root node (depth 0), we move right 
because y^ = 0. We then move left because y^^i = 1. We say 9io is the distribution 
associated with the string 110. Sometimes we need to refer to the leaf node holding the 
distribution associated with a string h; we denote that by M(h), where M is the model of 
the PST used to process the string. 

To use a prediction suffix tree of depth d for binary sequence prediction, we start with 
the distribution 6i := Pr,tr(l I f ) = 1/2 at each leaf node / of the tree. The first d bits y^.j 
of the input sequence are set aside for use as an initial context and the variable h denoting 
the bit sequence seen so far is set to yi-j. We then repeat the following steps as long as 
needed: 

1 . predict the next bit using the distribution 6h associated with h; 

2. observe the next bit y, update 9h using Formula (10) by incrementing either a or b 
according to the value of y, and then set /z := hy. 

Action-conditional PST. The above describes how a PST is used for binary sequence 
prediction. In the agent setting, we reduce the problem of predicting history sequences 
with general non-binary alphabets to that of predicting the bit representations of those 
sequences. Further, we only ever condition on actions and this is achieved by appending 
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bit representations of actions to the input sequence without a corresponding update of the 
KT estimators. These ideas are now formalised. 

For convenience, we will assume without loss of generality that = 2'^ and 
1^1 = 2''^' for some l^nJx > 0. Given a e Jl, we denote by [a]] = a[lj^] = 
a[l]a[2] . . .a[l^] e {0, 1}'-" the bit representation of a. Observation and reward symbols 
are treated similarly. Further, the bit representation of a symbol sequence xi-t is denoted 
by ix\:tJ = ^xi^^X2j . . . ^XtJ. The ith bit in ^xi-t^ is denoted by EJCi:(l[i] and the first / bits 
of is denoted by Ejci:J[1, /]. 

To do action-conditional prediction using a PST, we again start with 6i := Fvtil \ e) = 
1/2 at each leaf node / of the tree. We also set aside a sufficiently long initial portion of 
the binary history sequence corresponding to the first few cycles to initialise the variable 
h as usual. The following steps are then repeated as long as needed: 

1. set h := /z[[a]], where a is the current selected action; 

2. for / := 1 to Ix do 

(a) predict the next bit using the distribution 5/, associated with h; 

(b) observe the next bit x[i], update 6h using Formula (10) according to the value 
of x[i], and then set h := hx[i]. 

Now, let M be the model of a prediction suffix tree, L(M) the leaf nodes of M, ai:, e 
an action sequence, and xi-t e X' an observation-reward sequence. We have the following 
expression for the probability of xi-j given M and ai/. 

t Ik 

Vx{xi;t I M, ai;t) = n n I ^' lax<iaiJxi[\,i - 1]) 

= n Pr^dxiJiJ, (15) 

neL(M) 

where [[jci;,1|„ is the (non-contiguous) subsequence of t^i-tj that ended up in leaf node n 
in M. More precisely, 

lXlA\n := l[A:rl[/l][[^l J[/2] ■ • • IXlMA, 

where I <li < h < • ■ • < h < t and, for each i, i e {h,. . . /„} iff" M([[jci:,]][l, / - 1]) = n. 

The above deals with action-conditional prediction using a single PST. We now show 
how we can perform action-conditional prediction using a Bayesian mixture of PSTs in 
an efficient way. First, we need a prior distribution on models of PSTs. 

A prior on models of PSTs. Our prior, containing an Ockham-like bias favouring sim- 
ple models, is derived from a natural prefix coding of the tree structure of a PST. The 
coding scheme works as follows: given a model of a PST of maximum depth D, a pre- 
order traversal of the tree is performed. Each time an internal node is encountered, we 
write down 1 . Each time a leaf node is encountered, we write a if the depth of the leaf 
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node is less than D\ otherwise we write nothing. For example, if D = 3, the code for the 
model shown in Figure 2 is 10100; if D = 2, the code for the same model is 101 . The cost 
Td{M) of a model M is the length of its code, which is given by the number of nodes in 
M minus the number of leaf nodes in M of depth D. One can show that 



where Co is the set of all models of prediction suffix trees with depth at most D; i.e. the 
prefix code is complete. We remark that the above is another way of describing the coding 
scheme in [WST95]. We use 2 '^"^ ^ which penalises large trees, to determine the prior 
weight of each PST model. 

Context trees. The following is a key ingredient of the (action-conditional) CTW algo- 
rithm. 

Definition 7. A context tree of depth D is a perfect binary tree of depth D where the left 
and right edges are labelled 1 and respectively and attached to each node (both internal 
and leaf) is a probability on {0, 1 }*. 

The node probabilities in a context tree are estimated from data using KT estimators 
as follows. We update a context tree with the history sequence similarly to the way we 
use a PST, except that 

1. the probabilities at each node in the path from the root to a leaf traversed by an 
observed bit is updated; and 

2. we maintain block probabilities using Equations (12)-(14) instead of conditional 
probabilities (Equation (10)) like in a PST. (This is done for computational reasons 
to ease the calculation of the posterior probabilities of models in the algorithm.) 

The process can be best understood with an example. Figure 3 (left) shows a context 
tree of depth two. For expositional reasons, we show binary sequences at the nodes; 
the node probabilities are computed from these. Initially, the binary sequence at each 
node is empty. Suppose 1001 is the history sequence. Setting aside the first two bits 
10 as an initial context, the tree in the middle of Figure 3 shows what we have after 
processing the third bit 0. The tree on the right is the tree we have after processing 
the fourth bit 1. In practice, we of course only have to store the counts of zeros and 
ones instead of complete subsequences at each node because, as we saw earlier in (12), 
Prfa(5) = Frttias, b^). Since the node probabilities are completely determined by the input 
sequence, we shall henceforth speak unambiguously about the context tree after seeing a 
sequence. 

The context tree of depth D after seeing a sequence h has the following important 
properties: 

1 . the model of every PST of depth at most D can be obtained from the context tree 
by pruning off appropriate subtrees and treating them as leaf nodes; 
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Figure 3: A depth-2 context tree (left); trees after processing two bits (middle and right) 

2. the block probability of h as computed by each PST of depth at most D can be 
obtained from the node probabilities of the context tree via Equation (15). 

These properties, together with an application of the distributive law, form the basis of the 
highly efficient (action-conditional) CTW algorithm. We now formalise these insights. 

Weighted probabilities. We first need to define the weighted probabilities at each node 
of the context tree. Suppose a^-t is the action sequence and xi-j is the observation-reward 
sequence. Let be the (non-contiguous) subsequence of ^Xi-tj that ended up in node 

n of the context tree. The weighted probability P", of each node n in the context tree is 
defined inductively as follows: 

(P^ktilxi Ain) if n is a leaf node 

\ Pnt(lxi.,ln) + jP"Mxi A\n, I lai.,Wn:(lxv.,ln, I lautl) otherwise, 

where n/ and n,- are the left and right children of n respectively. Note that the set of 
sequences { : n is a node in the context tree } has a dependence on the action se- 
quence lai-tJ. 

If n is a node at depth d in a tree, we denote by p(n) e {0, 1 the path description to 
node n in the tree. 

Lemma 1 ([WST95]). Let D be the depth of the context tree. For each node n in the 
context tree at depth d, we have for all a\ -t e ^\ for all x\ -t e X\ 

Pim:tl\n\lai:tJ) = J] 2-^-"^^) ]^ Pr,,(|[xiJlp(„)p(o), (16) 

MeCo-d leL{M) 

where is the (non-contiguous) subsequence of^Xul that ended up in the node 

with path description p(n)p(l) in the context tree. 

Proof The proof proceeds by induction on d. The statement is clearly true for the leaf 
nodes at depth D. Assume now the statement is true for all nodes at depth d + \, where 
< d < D. Consider a node n at depth d. Letting d = D - d, we have 

PUUutlin I lai,l) 

= ^Pri-rd^iJiJ + ^P"MxiAin, I la,,J)P::(lxul\n, I lai,I) 
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MeC-: 



leUM) 

(r^(Mi)+r^(M2)+i). 



leL(M) 

^ 2-^-^'^' ]~[ Pr,,([^iJ|p(„,.)p(/)) 

:L(Mi) J L/eL( 



leLiMoJ 



leL(MiM2) 



Mi^Co-d leL(M) 



where M1M2 denotes the tree in whose left and right subtrees are Mi and M2 respec- 
tively. □ 

CTW as an optimal Bayesian mixture predictor. A corollary of Lemma 1 is that at 
the root node A of the context tree we have 



MeCo /€L(M) 



MeCo leL{M) 

= Y 2^'"^''^Pr(xi.|M,ai,), 



(17) 
(18) 
(19) 



MeCo 



where the last step follows from Equation (15). Note carefully that 1[jci:,]]|/x/) in line (17) 
denotes the subsequence of Ixi ,] that ended in the node pointed to by p(l) in the context 
tree but [jci:J|/ in line (18) denotes the subsequence of ixi-,J that ended in the leaf node / in 
M if M is used as the only model to process [[jci ,]. Equation (19) shows that the quantity 
computed by the (action-conditional) CTW algorithm is exactly a Bayesian mixture of 
(action-conditional) PSTs. 

The weighted probability P^, is a block probability. To recover the conditional proba- 
bility of Xt given ax<,a,, we simply evaluate 

",„(1^J 1<5['V^</6[J) = ; , 

which follows directly from Equation (3). To sample from this conditional probability, 
we simply sample the individual bits of one by one. For brevity, we will sometimes use 
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the following notation for P'l,: 

r(xi,\ai,):=Pi(lxuI\lai,l) 
Yixt I ax<:tat) := /'hXI^?]] I lax<:ta,J). 

In summary, to do action-conditional prediction using a context tree, we set aside a 
sufficiently long initial portion of the binary history sequence corresponding to the first 
few cycles to initialise the variable h and then repeat the following steps as long as needed: 

1. set /z := /z[[a]], where a is the current selected action; 

2. for / := 1 to do 

(a) predict the next bit using the weighted probability P'l; 

(b) observe the next bit x[i], update the context tree using h and x[i], calculate the 
new weighted probability P'l, and then set h := hx[i]. 

Note that in practice, the context tree need only be constructed incrementally as needed. 
The depth of the context tree can thus take on non-trivial values. This memory require- 
ment of maintaining a context tree is discussed further in Section 7. 

Reversing an update. As explained in Section 3, the Revert operation is performed 
many times during search and it needs to be efficient. Saving and restoring a copy of the 
context tree is unsatisfactory. Luckily, the block probability estimated by CTW using a 
context depth of D at time t can be recovered from the block probability estimated at time 
t + min 0{mD) operations in a rather straightforward way. Alternatively, a copy on write 
implementation can be used to modify the context tree during the simulation phase. 

Predicate CTW. As foreshadowed in [Bun92, HS97], the CTW algorithm can be gen- 
eralised to work with rich logical tree models [BD98, KWOl, Llo03, Ng05, LN07] in 
place of prediction suffix trees. A full description of this extension, especially the part on 
predicate definition/enumeration and search, is beyond the scope of the paper and will be 
reported elsewhere. Here we outline the main ideas and point out how the extension can 
be used to incorporate useful background knowledge into our agent. 

Definition 8. Let V = {pQ, px,. . ., p,„} be a set of predicates (boolean functions) on histo- 
ries h e {J\ X Xy\ n>Q. A V-model is a binary tree where each internal node is labelled 
with a predicate in V and the left and right outgoing edges at the node are labelled True 
and False respectively. A P-tree is a pair (Mp, 0) where Mp is a P-model and associ- 
ated with each leaf node I in Mp is a probability distribution over {0, 1} parameterised by 
e, e 0. 

A ^-tree (Mp, 0) represents a function g from histories to probability distributions on 
{0, 1} in the usual way. For each history h, g{h) = Oi,^, where //, is the leaf node reached 
by pushing h down the model Mp according to whether it satisfies the predicates at the 
internal nodes and 0/,, e is the distribution at //,. 
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The use of general predicates on histories in f -trees is a powerful way of extending 
the notion of a "context" in applications. To begin with, it is easy to see that, with a 
suitable choice of predicate class V, both prediction suffix trees (Definition 6) and looping 
suffix trees [HJ06] can be represented as !P-trees. Much more background contextual 
information can be provided in this way to the agent to aid learning and action selection. 

The following is a generalisation of Definition 7. 

Definition 9. Let V = {pQ,pi, . . . ,Pm} be a set of predicates on histories. A V-context 
tree is a perfect binary tree of depth m + 1 where 

1. each internal node at depth i is labelled by pi e V and the left and right outgoing 
edges at the node are labelled True and False respectively; and 

2. attached to each node (both internal and leaf) is a probability on {0, 1}*. 

We remark here that the (action-conditional) CTW algorithm can be generalised to 
work with ^-context trees in a natural way, and that a result analogous to Lemma 1 but 
with respect to a much richer class of models can be established. A proof of a similar 
result is in [HS97]. Section 7 describes some experiments showing how predicate CTW 
can help in more difficult problems. 

5 Putting it All Together 

We now describe how the entire agent is constructed. At a high level, the combination 
is simple. The agent uses the action-conditional (predicate) CTW predictor presented 
in Section 4 as a model Y of the (unknown) environment. At each time step, the agent 
first invokes the Predictive UCT routine to estimate the value of each action. The agent 
then picks an action according to some standard exploration/exploitation strategy, such 
as 6-Greedy or Softmax [SB98]. It then receives an observation-reward pair from the 
environment which is then used to update T. Communication between the agent and 
the environment is done via binary codings of action, observation, and reward symbols. 
Figure 4 gives an overview of the agent/environment interaction loop. 

It is worth noting that, in principle, the AIXI agent does not need to explore according 
to any heuristic policy. This is since the value of information, in terms of expected future 
reward, is implicitly captured in the expectimax operation defined in Equations (1) and 
(2). Theoretically, ignoring all computational concerns, it is sufficient just to choose a 
large horizon and pick the action with the highest expected value at each timestep. 

Unfortunately, this result does not carry over to our approximate AIXI agent. In prac- 
tice, the true environment will not be contained in our restricted model class, nor will 
we perform enough Predictive UCT simulations to converge to the optimal expectimax 
action, nor will the search horizon be as large as the agent's maximal lifespan. Thus, the 
exploration/exploitation dilemma is a non-trivial problem for our agent. We found that the 
standard heuristic solutions to this problem, such as 6-Greedy and Softmax exploration, 
were sufficient for obtaining good empirical results. We will revisit this issue in Section 7. 
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Figure 4: Tlie AIXI-MC agent loop 

6 Theoretical Results 

Some theoretical properties of our algorithm are now explored. 



Model class approximation. We first study the relationship between T and the univer- 
sal predictor in AIXI. Using T in place of p in Equation (6), the optimal action for an 
agent at time t, having experienced axi-,-i, is given by 

T(JC1:, |<3i:f) 



= arg max 

at 



z 



max 



Z 



: arg max 



t+m 1 t+m 



■ arg max 

a, 



= arg max 



Z 

-V, 

z 



Xt+in L l—t 

' t+m 



■.t+m I ^l:r+m) 
'^{^<t+m I ^<t+in) 

T(;ci:; I au) 
r(x<i I a</) 



Z' 



nax > > 
^ ^ 

^r+m L t—t 
' t+m 

nax > ) 

a,^„, ^ ^ 



= arg max 

a, 



z 



' t+m 



max y y n y ^^^^^^ M^^+m 1 M, ai:,+,J. 

.t,+„, L J MeCo 

Contrast (20) now with Equation (2) which we reproduce here: 



:t+m I ^l:;+m) 

T(x<, I a<,) 

:r+m I (^l:t+m) 
-Td(M) 



at = arg max y... max y y r, y 2 '^'^VC^hr+m I «i:r+m), 



(20) 



(21) 
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where M is the class of all enumerable chronological semimeasures, and K(p) denotes 
the Kolmogorov complexity of p [Hut05]. The two expressions share a prior that enforces 
a bias towards simpler models. The main difference is in the subexpression describing 
the mixture over the model class. AIXI uses a mixture over all enumerable chronological 
semimeasures. This is scaled down to a mixture of prediction suffix trees in our setting. 
Although the model class used in AIXI is completely general, it is also incomputable. Our 
approximation has restricted the model class to gain the desirable computational proper- 
ties of CTW. As indicated in Section 4, the model class Co can be significantly enlarged 
by using predicates without sacrificing the efficient computability of mixtures. 

Convergence to true environment. We show in this section that if there is a good 
model of the (unknown) environment in the class Co, then CTW will 'find' it. We need 
the following entropy inequality. 

Lemma 2 ([Hut05]). Let {>>,} and {zi} be two probability distributions, i.e. > 0, > 0, 
and 2/ = Tji^i = 1- Then we have 



Theorem 1. Let fx be the true environment model. The fi-expected squared difference of 
p and T is bounded as follows. For all n e N, for all a\ -n, 




n 




iu(x^k I a<k)W(xk I ax^tak) - T(xk \ ax^k^k) 



k=l x,.,k 




where Dkl(- II is the KL divergence of two distributions. 



Proof. We adapt a proof from [Hut05, §5.1.3]. 




Kx^k I a^k){MiXk I ax^kCik) - T(;c^. | ax^kCik) 




k=l Xl:k 




fi(x^k I a<k) > \MiXk I ax^kCik) - T(jc^ | ax^kak) 




k=\ x^k Xk 




[by Lemma 2] 



[by Defn. 2] 
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k=l xi-„ 



Yn{Xi:n\ai:n)Y\n 

^^^k I ax^kak) 



^IJ.Ul:n\ai;n) In 



= ^//(jci:„ |ai:„)ln 

Xl-M 



lujxk I ax^ uak) 
T{xk I ax 

I Cll:n) 

l^{Xl:n\ai;n) Pr(Xi:„ | M, 



[byEq. (4)] 
[arbitrary M e C^] 



Pr(xi:JM,ai:„) T(Xi:„|ai:„) 



= 7 /i(jci„ 1 In — " — ^— ^ h ..,„,..^„^„- 

^ Pr(xi:„ |M,ai:„) ^ T(xi:„|ai:„) 



^;u(a:i:„ |ai:„)ln 



< DKiiMi- 1 ai:„) II Pr(- 1 M, ai:„)) + ^ //(jci:„ | ai:„) In — - 



Pr(;ci:JM,ai:„) 



2-r«(M)pr(^j^jM,ai:„) 



[byEq. (19)] 



DkM I II Pr(- 1 M, ai:„)) + rz5(M) In 2. 



Since the inequality holds for arbitrary M e Co, it holds for the minimising M. 



□ 



If the KL divergence between // and the best model in Co is finite, then Theorem 1 
implies T(xk \ ax^k^k) will converge rapidly to fi(xk \ ax^k^^u) for k ^ oo with //-probability 
1 . The contrapositive of the statement tells us that if T fails to predict the environment 
well, then there is no good model in Cq. This result provides the motivation for looking 
at ways of enriching the model class in Section 8. 



Consistency of Predictive UCT. Let /i be the true underlying environment. We now 
establish the link between the expectimax value V™(/z) and its estimate Vy{h) computed 
by the Predictive UCT algorithm using T as the environment model. 

In [KS06], the authors show that the UCT algorithm is consistent in finite horizon 
MDPs and derive finite sample bounds on the estimation error due to sampling. By inter- 
preting histories as Markov states, our general agent problem reduces to a finite horizon 
MDP and the results of [KS06] are now directly applicable. Restating the main consis- 
tency result in our notation, we have 

VtV/z lim Vx(\Y^{h) - y;X/?)l < e) = 1. (22) 

Further, the probability that a suboptimal action (with respect to V'^{-)) is picked by Pre- 
dictive UCT goes to zero in the limit. Details of this analysis can be found in [KS06]. 

Theorem 1 above in conjunction with [Hut05, Thm.5.36] implies Vy{h) Vj^\h), as 
long as there exists a model in the model class that approximates the unknown environ- 
ment /i well. This, and the consistency (22) of the Predictive UCT algorithm, imply that 
V'^(h) will converge to V'^(h). 
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L-'UlIlalll 


AliQcinrr 
/\llaSlIlg 


iNoisy c/ 


INOlSy Jrl 


uiiiiiioiiiiaiive 


Id-maze 


yes 


no 


no 


yes 


Cheese Maze 


yes 


no 


no 


no 


iiger 


yes 


yes 


IIU 


IIU 


Extended Tiger 


yes 


yes 


no 


no 


4x4 Grid 


yes 


no 


no 


yes 


TicTacToe 


no 


no 


no 


no 


Biased Rock-Paper-Scissor 


no 


yes 


yes 


no 


Partially Observable Pacman 


yes 


no 


no 


no 



Table 1 : Domain characteristics 



7 Experimental Results 

In this section we evaluate our algorithm on a number of pre-existing domains. We have 
chosen domains that, from the agent's perspective, have noisy perceptions, partial infor- 
mation, and inherent stochastic elements. In particular, we will focus on learning and 
approximately solving some benchmark POMDPs. The planning problem (i.e. computa- 
tion of the optimal policy given the full POMDP model) associated with these POMDPs 
were considered challenging in the mid-nineties but can now be solved easily. We stress 
here that our requirement of having to learn the environment model, as well as solve the 
planning problem, significantly increases the difficulty of these problems. 

As we shall see, our agent achieves state-of-the-art performance in both generality 
(eight separate problems with different characteristics are attempted) and optimality (the 
agent converges to the optimal policy in seven cases, and exhibits good scaling properties 
in the remaining case). 

Our test domains are now described in detail. Their characteristics are summarised in 
Table 1. 

Id-maze. The Id-maze is a simple problem from [CKL94]. The agent begins at a ran- 
dom, non-goal location within a 1 x 4 maze. There is a choice of two actions: left or right. 
Each action transfers the agent to the adjacent cell if it exists, otherwise it has no effect. 
If the agent reaches the third cell from the left, it receives a reward of 1 . Otherwise it 
receives a reward of 0. The distinguishing feature of this problem is that the observations 
are uninformative; every observation is the same regardless of the agent's actual location. 

Cheese maze. This well known problem is due to [McC96] . The agent is a mouse inside 
a two dimensional maze seeking a piece of cheese. The agent has to choose one of four 
actions: move up, down, left or right. If the agent bumps into a wall, it receives a penalty 
of -10. If the agent finds the cheese, it receives a reward of 10. Each movement into 
a free cell gives a penalty of -1. The problem is depicted graphically in Figure 5. The 
number in each cell represents the decimal equivalent of the four bit binary observation 
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the mouse receives in each cell. The problem exhibits perceptual aliasing in that a single 
observation is potentially ambiguous. 
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Figure 5: The cheese maze 



Tiger. This is another familiar domain from [KLC95]. The environment dynamics are 
as follows: a tiger and a pot of gold are hidden behind one of two doors. Initially the 
agent starts facing both doors. The agent has a choice of one of three actions: listen, open 
the left door, or open the right door. If the agent opens the door hiding the tiger, it suffers 
a -100 penalty. If it opens the door with the pot of gold, it receives a reward of 10. If 
the agent performs the listen action, it receives a penalty of -1 and an observation that 
correctly describes where the tiger is with 0.85 probability. 

Extended Tiger. The problem setting is similar to Tiger, except that now the agent 
begins sitting down on a chair. The actions available to the agent are: stand, listen, open 
the left door, and open the right door. Before an agent can successfully open one of the 
two doors, it must stand up. However, the listen action only provides information about 
the tiger's whereabouts when the agent is sitting down. Thus it is necessary for the agent 
to plan a more intricate series of actions before it sees the optimal solution. The reward 
structure is slightly modified from the simple Tiger problem, as now the agent gets a 
reward of 30 when finding the pot of gold. 

4x4 Grid. The agent is restricted to a 4 x 4 grid world. It can move either up, down, 
right or left. If the agent moves into the bottom right corner, it receives a reward of 1, and 
it is randomly teleported to one of the remaining 15 cells. If it moves into any cell other 
than the bottom right corner cell, it receives a reward of 0. If the agent attempts to move 
into a non-existent cell, it remains in the same location. Like the Id-maze, this problem 
is also uninformative but on a much larger scale. Although this domain is simple, it does 
require some subtlety on the part of the agent. The correct action depends on what the 
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agent has tried before at previous time steps. For example, if the agent has repeatedly 
moved right and not received a positive reward, then the chances of it receiving a positive 
reward by moving down are increased. 

TicTacToe. In this domain, the agent plays repeated games of TicTacToe against an 
opponent who moves randomly. If the agent wins the game, it receives a reward of 2. If 
there is a draw, the agent receives a reward of 1. A loss penalises the agent by -2. If the 
agent makes an illegal move, by moving on top of an already filled square, then it receives 
a reward of -3. A legal move that does not end the game earns no reward. 

Biased Rock- Paper-Scissor. This domain is taken from [FMRW09]. The agent repeat- 
edly plays Rock-Paper-Scissor against an opponent that has a slight, predictable bias in 
its strategy. If the opponent has won a round by playing rock on the previous cycle, it 
will always play rock at the next timestep; otherwise it will pick an action uniformly at 
random. The agent's observation is the most recently chosen action of the opponent. It 
receives a reward of 1 for a win, for a draw and -1 for a loss. 

Partially Observable PacMan. This domain is a partially observable version of the 
classic PacMan game. The agent must navigate a 17 x 17 maze and eat the food pellets 
that are distributed across the maze. Four ghosts roam the maze. They move initially 
at random, until there is a Manhattan distance of 5 between them and PacMan, where- 
upon they will aggressively pursue PacMan for a short duration. The maze structure and 
game are the same as the original arcade game, however the PacMan agent is hampered 
by partial observability. PacMan is unaware of the maze structure and only receives a 
4-bit observation describing the wall configuration at its current location. It also does 
not know the exact location of the ghosts, receiving only 4-bit observations indicating 
whether a ghost is visible (via direct line of sight) in each of the four cardinal directions. 
In addition, the location of the food pellets is unknown except for a 3-bit observation 
that indicates whether food can be smelt within a Manhattan distance of 2, 3 or 4 from 
PacMan's location, and another 4-bit observation indicating whether there is food in its 
direct line of sight. A final single bit indicates whether PacMan is under the effects of a 
power pill. At the start of each episode, a food pellet is placed down with probability 0.5 
at every empty location on the grid. The agent receives a penalty of 1 for each movement 
action, a penalty of 10 for running into a wall, a reward of 10 for each food pellet eaten, 
a penalty of 50 if it is caught by a ghost, and a reward of 100 for collecting all the food. 
If multiple such events occur, then the total reward is cumulative, i.e. running into a wall 
and being caught would give a penalty of 60. The episode resets if the agent is caught or 
if it collects all the food. 

Figure 6 shows a graphical representation of the partially observable PacMan domain. 
This problem is the largest domain we consider, with an unknown optimal policy. The 
main purpose of this domain is to show the scaling properties of our agent with respect to 
a challenging problem. 
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Figure 6: A screenshot (converted to b&w) of the partially observable PacMan domain 



Experimental setup. Table 2 outlines the parameters used in each experiment. The 
sizes of the action and observation spaces are given. The Jl bits, O bits and 'R bits param- 
eters specify the number of bits used to encode the action, observation and reward spaces. 
The context depth parameter D specifies the maximum number of most recent bits used 
by the action-conditional CTW prediction scheme. The search horizon is specified by the 
parameter m. 

The experimental results are presented in terms of average reward per time step. The 
key factors of interest are the performance of the agent as it accumulates more real world 
experience, and the performance of the agent as it is given more thinking time per deci- 
sion. 

All experiments were performed on a dual quad-core Intel 2.53Ghz Xeon. If com- 
putational concerns could be ignored, it would be natural to make D as large as possible 
since CTW is robust against overfitting due to its strong bias towards simple PSTs. There 
are similar issues with the choice of horizon; ideally the horizon would be as large as 
possible if we could ignore computational concerns. In practice however, these param- 
eters must be made much smaller for our agent to be tractable on our modest hardware. 
Section 7 discusses the asymptotic properties of our algorithms. Although the asymptotic 
behaviour is excellent (essentially linear in D and m in terms of both time and space), our 
prototype implementation is still pushing the boundaries of what can be done on a present 
day workstation. There are obvious problems if these parameters are set too small. For 
example, if the problem is n-Markov but we only use a D < n, or if the optimal policy 
requires planning ahead more than m steps, then we cannot expect the agent to perform 
optimally. 
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Table 2: Parameter Configuration 



Scaling properties. Our agent has both limited thinking time and a limited amount of 
time to gather experience in the real world. Potentially, both of these dimensions will 
affect the agent's performance. This section explores what the agent's performance on 
different problem domains as we vary the two parameters. 

Figure 7 shows the performance of the agent as it accumulates more experience. Two 
seconds of search time per decision was used for each experiment. The label Age for the 
horizontal axis refers to the number of cycles that has transpired. 

Figure 8 shows the performance of the agent on each problem domain by running 
it with varying amounts of search. The environment model used for each experiment 
was learned by the agent from randomly interacting with the environment for 50'000 
timesteps, with the exception of TicTacToe which used a model built from SOO'OOO 
timesteps. Random action selection was used for computational reasons; it allowed large 
amounts of experience to be gathered quickly. For each data point, the agent is run for 
2000 timesteps, using the best action chosen greedily by Predictive UCT. The average 
reward is then calculated from the performance across these 2000 timesteps. 

General discussion. In all cases, given sufficient thinking time and experience, the 
performance of our agent approaches optimality. Generally speaking, the agent's per- 
formance gets better as it acquires more experience and is given more search time per 
decision. The agent's performance on the tiger domains warrants some discussion. 

The behaviour of the agent in the Tiger domain varies as the amount of interaction 
with the environment increases. Initially, the agent avoids selecting a door, as it is too 
uncertain about the environment dynamics. However, as it gathers more experience, more 
sophisticated behaviour emerges; the agent correctly acquires multiple pieces of informa- 
tion before picking a door. If some of the information is contradictory, the agent gathers 
more information before making its decision. 

The performance of the agent in the Extended Tiger domain is sensitive to the number 
of simulations used by Predictive UCT. As can be seen in Figure 7, two seconds of think- 
ing time were insufficient to act optimally. As indicated by figure 8, optimal behaviour is 
only achieved when using a minimum of approximately lO'OOO simulations per decision. 
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Figure 7: Average reward vs age (measured in number of cycles). Two seconds of search 
were used for each action. 



29 



Search Scalability - Id Maze 




Search Scalability - Cheese Maze 



Number o1 Simulations per Timestep 
Search Scalability - Tiger 




Number of Simulations per Timestep 
Search Scalability - 4x4 grid 



— y- — Empirical 
- - Optimal 




- Empirical 
Optimal 



Number of Simulations per Timestep 



Search Scalability - Extended Tiger 




Number of Simulations per Timestep 
Search Scalability - TicTacToe 




Number of Simulations per Timestep 



Number of Simulations per Timestep 



Figure 8: Average reward vs search effort (measured in terms of the number of simulations 
used for picking each action). 
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Scaling Properties - Partially Observable Pacman 
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Figure 9: Scaling properties on Partially Observable Pacman 

Only then does agent to understand that it is worth listening initially, then standing up, 
and then finally choosing the correct door according to the information it gathered whilst 
sitting down. With less simulations, the agent avoids picking a door. Interestingly, the 
performance of the agent drops after it has interacted with the world 5000 times, yet then 
sharply increases. At 5000 steps, the agent has overcome its aversion towards picking a 
door, without fully understanding the environment dynamics. This causes the agent to 
sometimes pick the wrong door. Further interaction refines the environment model and 
subsequently allows the agent to improve its performance. 



Performance on a challenging domain. Above we introduced the partially observable 
Pacman domain. In contrast to our other domains, this is an enormous problem. Even if 
the underlying state space were known, the learning and planning problems would still be 
hard because there are more than 2^" states. 

Figure 9 shows the scaling properties of our agent. Again, random exploration was 
used to build the model for computational reasons. The average reward at each data point 
was gathered by running the agent for 4000 timesteps, with each action being determined 
by Predictive UCT. 

Visually, the performance of the agent was non-optimal. However, after 2.5 million 
cycles of interaction, the agent had managed to learn a number of important concepts. It 
knows not to run into walls. It knows how to seek out food from the limited information 
provided by its sensors. It knows how to run away and avoid chasing ghosts. The main 
subtlety that it hasn't learnt (after 2.5 million timesteps) is to aggressively chase down 
ghosts when it has eaten a red power pill. Also, its behaviour can sometimes become 
temporarily erratic when stuck in a long corridor with no nearby food or visible ghosts. 
Still, the ability to perform reasonably in a large domain, and exhibit consistent increases 
in performance with additional resources (experience or search time) makes us optimistic 
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about the long-term potential of our work. 

Heuristic playout function. An important parameter in Predictive UCT is the choice of 
the playout function. In MCTS-based methods for playing Computer Go, it is well known 
that adding knowledge to the playout function can dramatically improve performance 
[GWMT06]. One of the benefits of MCTS methods is that if the domain is known, the 
playout function presents a natural way to incorporate domain knowledge. In the general 
agent setting, it would be desirable to automatically gain some of the benefits of expert 
design through online learning. 

If the domain is unknown, a natural baseline playout policy is one that selects between 
each action uniformly at random. Although this playout policy is obviously quite poor, 
it does make some heuristic sense: the playouts end up guiding the search toward areas 
that give off larger rewards without requiring a carefully planned action sequence. In 
Section 3, we described an intuitive method to incrementally learn a playout policy by 
attempting to model the real-world actions chosen by Predictive UCT. The aim of this 
section is to show that our heuristic approach, using a CTW-based action predictor as a 
playout function, can give significant improvements to Predictive UCT over the naive, 
uniformly random policy. 

Figure 10 shows the impact of using the learned playout function on the cheese maze. 
(The other domains we tested exhibit similar behaviour.) Two versions of the same agent 
were run for 120'000 cycles. Actions were selected using an 6-greedy policy: i.e. with 
probability e the agent moved randomly, otherwise the best action according to Predictive 
UCT was chosen. The initial e of 0.9 was decayed by multiplying by 0.999 at each time- 
step. A small (100 or 500) Predictive UCT simulations were used to decide on each action, 
to maximise the impact of the playout policy on the overall agent performance. The agent 
that used the self-improving playout policy learned faster and obtained a higher maximum 
average reward than the agent using uniform random playouts. Although the difference in 
average reward is small numerically, there is a qualitative difference in the performance of 
the agent. For example, the uniform playout policy when using 100 simulations averages 
approximately -1 per timestep. This is equivalent to a policy that simply runs around 
the maze, never finding the cheese, without ever bumping into a wall. When using 100 
learned playouts however, the average reward ends up greater than zero. To achieve this, 
the agent must be finding the cheese, on average, in less than 1 1 steps every instance. 

Our results demonstrate that it is both reasonable and practical for a MCTS-based 
general reinforcement learning agent to attempt to learn a playout function online. Our 
results are by no means exhaustive. The ideal action predictor may not resemble the 
observation/reward predictor, or it may be designed with different speed/accuracy trade- 
offs in mind. Online learning of playout functions for MCTS-based agents is a promising 
direction for future research. Building on this idea, one could also look at ways to modify 
the UCB policy used in Predictive UCT to automatically take advantage of learnt playout 
knowledge, similar to the heuristic techniques used in Computer Go [GS07]. 
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Figure 10: Impact of learned playout function on performance 



Computational considerations. If an agent has interacted with the world for t cycles, 
using a context tree with depth D, there is at most 0{tD\og{\0\\'R\)) nodes in the context 
tree. In practice, unless the environment is very noisy, only a subset of the 2^ possi- 
ble contexts will be created. In our experiments, no more than a gigabyte of memory 
was required to store the entire environment model. The time complexity of CTW is 
also impressive: 0{D) to generate a single bit, and 0{Dm\og{\0\\'R\)) to generate the m 
observation/reward pairs needed to perform a single Predictive UCT simulation. 



Predicate CTW. This section gives an example of how Predicate CTW can be used to 
incorporate domain knowledge that drastically simplifies the agent's learning task. We 
saw earlier in Figure 7 that the dynamics of TicTacToe required a large amount of training 
examples for CTW to correctly predict the environment dynamics. Essentially, the main 
difficulty for the first hundred thousand steps was avoiding making illegal moves. In this 
experiment, the set of predicates that define CTW was augmented with a predicate that 
indicated whether the last move by the agent was legal. As one would expect, the agent 
using this augmented predicate set quickly learnt to play according to the game rules. 
Figure 1 1 shows how a small but carefully chosen piece of domain knowledge can have a 
significant impact on the agent's performance. 
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Predicate CTW versus CTW on TicTacToe 




Figure 11: Impact of domain knowledge, using 1000 Predictive UCT simulations. 

8 Discussion 

We discuss some related and future work in this section. The headings reflect the general 
area of the literature in which those work can be found. 



Algorithmic Information Theory. There have been several attempts at studying the 
computational properties of AIXI. In [Hut02], an asymptotically optimal algorithm is 
proposed that, in parallel, picks and runs the fastest program from an enumeration of 
provably correct programs for any given well-defined problem. A similar construction 
that runs all programs of length less than I and time less than t per cycle and picks the best 
output (in the sense of maximizing a provable lower bound for the true value) results in 
the optimal time bounded AIXI?/ agent [Hut05, Chp.7]. Like Levin search [Lev73], such 
algorithms are not practical in general but can in some cases be applied successfully; see 
e.g. [Sch97, SZW97, Sch03, Sch04]. 

In tiny domains, universal learning is computationally feasible with brute-force 
search. In [PH06] the behaviour of AIXI is compared with a universal predicting- with- 
expert-advice algorithm [PH05] in repeated 2x2 matrix games and shows they exhibit 
different behaviour. 

A Monte Carlo algorithm is proposed in [Pan08] that samples programs according 
to their algorithmic probability as a way of approximating Solomonoff's universal prior. 
A closely related algorithm is that of speed prior sampling [Sch02]. It remains an open 
question whether algorithms that sample from the space of general Turing machines can 
be made to work in practical problems. 

General Reinforcement Learning. We move on next to a discussion of related work 
in the general RL literature. An early and influential work is the Utile Suffix Memory 
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(USM) algorithm by McCallum [McC96]. USM uses a suffix tree to partition the agent's 
history space into distinct states, one for each leaf in the suffix tree. Associated with each 
state/leaf is a Q-value, which is updated incrementally from experience like in Q-leaming 
[WD92]. The history-partitioning suffix tree is grown in an incremental fashion, starting 
from a single leaf node in the beginning. A leaf in the suffix tree is split when the history 
sequences that fall into the leaf are shown to exhibit statistically different Q- values. The 
USM algorithm works well for a number of tasks but could not deal effectively with 
noisy environments. Several extensions of USM to deal with noisy environments are 
investigated in [SB04, ShaOV]. USM and their extensions are usually well-motivated but 
lack formal performance guarantees. 

The work closest to ours in the general RL literature is the BLHT algorithm described 
in [SHL97, SH99]. As in the present work, Suematsu et al. use prediction suffix trees 
as the model class but their suffix trees are defined at the symbol level (like in USM) 
as opposed to the bit level at which we operate. Another difference is that BLHT uses 
the maximum a posteriori (MAP) model to predict the future at any one time whereas 
we use a mixture of models. Having said that, the actual data structure and algorithm 
used in [SHL97, SH99] to efficiently compute the MAP model bears close resemblance 
to CTW, and their algorithm may indeed be a general form of the context tree maximising 
algorithm [VW95]. In their experiments, Suematsu et al. chose to use a uniform prior 
over the tree models even though their algorithm would work with an Ockham prior like 
that given in Equation (20). It is also worth noting that our use of a Bayesian mixture 
admits a much stronger convergence result compared to what can be proved for BLHT. 
For control, BLHT uses an (unspecified) dynamic programming based algorithm. 

The active LZ algorithm [FMRW09] is also similar in spirit to our work. It combines 
a Lempel-Ziv [ZL77] based prediction scheme with dynamic programming for control 
to produce an agent that is provably optimal if the environment is n-Markov, for some 
arbitrary n. They introduced and evaluated the performance of their agent on the (n- 
Markov) biased Rock-Paper-Scissor domain. We ran our agent on the same domain, using 
action-conditional CTW, 10000 Predictive UCT simulations and a uniform play out policy. 
Figure 12 shows our results overlayed with their reported results. Though it is difficult 
to compare implementations, it is clear that our agent has reached optimal performance 
using vastly less (at least two orders of magnitude) experience. 

Predictive state representations (PSRs) [LSS02, SJR04] maintain predictions of future 
experience. Formally, a PSR is a probability distribution over the agents future experi- 
ence, given its past experience. A subset of these predictions, the core tests, provide a 
sufficient statistic for all future experience. PSRs provide a Markov state representation, 
can represent and track the agents state in partially observable environments, and provide 
a complete model of the worlds dynamics. Unfortunately, exact representations of state 
are impractical in large domains, and some form of approximation is typically required. 
There is considerable interest in PSRs but there are at present still no satisfactory learning 
and discovery algorithms for PSRs. 

Temporal-difference networks [ST04] are a form of predictive state representation in 
which the agent's state is approximated by abstract predictions. These can be predic- 
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Comparison of Active LZ with AIXI-MC on biased Rock-Paper-Scissor 




Figure 12: Comparison between AIXI-MC (using action-conditional CTW, 10k Predic- 
tive UCT simulations and uniform playouts) and the Active-LZ algorithm. 



tions about future observations, but also predictions about future predictions. This set of 
interconnected predictions is known as the question network. Temporal-difference net- 
works learn an approximate model of the worlds dynamics: given the current predictions, 
the agents action, and an observation vector, they provide new predictions for the next 
time-step. The parameters of the model, known as the answer network, are updated after 
each time-step by temporal-difference learning. Some promising recent results applying 
TD-Networks for prediction (but not control) to small POMDPs have been reported in 
[Mak09]. 

Model Learning and CTW. Bayesian model averaging is a well-studied technique in 
statistics and machine learning [HMRV99, Bun92, OH95, CGM98]. There is a nice con- 
nection between CTW, Buntine's tree- smoothing algorithm [Bun92], Winnow-style on- 
line learning [Lit88, LW94], and boosting [FS97]. The key idea behind Lemma 1 appears 
in [Bun92, Lemma 6.5.1]. The same technique is used in [HS97] to implement an effi- 
cient version of the P{fi) online learning algorithm [CBFH^93] as a way of avoiding the 
problematic post-pruning step in decision-tree induction [BFOS84]. [PS99] then builds 
on that work to implement an efficient version of the Hedge algorithm [FS97] for con- 
structing mixtures of the larger class of edge-based (as opposed to node-based) prunings 
of a tree. The algorithm in [PS99] can be used in conjunction with the predicate CTW 
idea to enlarge our agent's model class. 

There are several noteworthy ways the basic CTW algorithm can be extended. The fi- 
nite depth limit on the context tree can be removed [Wil94] without increasing the asymp- 
totic space overhead of the algorithm. We chose to avoid this extension however due to 
the asymptotic time complexity increase of generating a symbol from linear in the con- 
text depth to linear in the number of observed symbols. CTW has also been extended to 
general non-binary alphabets, and the state-of-the-art seems to be the DE-CTW algorithm 
[BEYY04, BEY06]. We opted not to use DE-CTW for several reasons. Firstly, DE-CTW 
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is not a strictly online algorithm: a preprocessing phase is required to compute a way 
of decomposing the alphabets. Secondly, what is computed by DE-CTW isn't really a 
Bayesian mixture and this is an unnecessary deviation from the theory of AIXI. Lastly, 
most of the effects of decomposing alphabets can in fact be realised using the predicate 
CTW extension. 

Future work. Our experimental results have been restricted to problems of modest size. 
Future work will attempt to apply the algorithms presented here to more challenging 
domains. 

The biggest limitation of our current agent is the restricted model class. Prediction 
suffix trees are simplistic models, inadequate to compactly represent something as simple 
as the rules of TicTacToe. Furthermore, the strong emphasis placed by CTW on tempo- 
rally recent symbols is appropriate for only a subset of interesting real- world problems. 
The aim of the Predicate CTW extension is to relax this restriction somewhat, yet keep 
the desirable computational properties of CTW. As these predicates are arbitrary boolean 
functions on the agent's history, they have the power to represent more complicated pieces 
of information that are useful to an agent in terms of making sensible predictions. Domain 
knowledge can be encoded in the form of user-supplied predicates, which seems essential 
for our agent to have any realistic chance of scaling to problems with real-world visual 
or audio data. Given a large model class ^, the main learning problem in predicate CTW 
is in the identification of a small subset V of V that is relevant to the current environ- 
ment. This is a major unsolved problem in our setup and we think a suitable application 
of the Minimum Message Length principle [Wal05] along the lines of [Hut09b] would 
shed much light on the key issues. 

Furthermore, the performance of our agent is dependent on the amount of thinking 
time allowed at each time step. A crucial property of Predictive UCT is that it is naturally 
parallel. A prototype parallel implementation of Predictive UCT has been completed, 
with promising scaling results using between 4 and 8 processing cores. We are confident 
that further improvements to our prototype implementation will allow us to solve prob- 
lems where the amount of search, rather than the agent's predictive power, is the main 
performance bottleneck. Continuing advances in computer hardware will no doubt help 
address this issue as well. 

9 Conclusion 

The main contribution of the paper is the extension and synthesis of two key results from 
online MDP planning (UCT) and information theory/machine learning (CTW) in the de- 
sign of an agent that directly and scalably approximates the AIXI ideal. This is an im- 
portant result. Although well established theoretically, it has previously been unclear 
whether AIXI could motivate the design of practical, yet theoretically well-founded algo- 
rithms. Our work answers this question strongly in the affirmative: empirically, our AIXI 
approximation achieves state-of-the-art performance and theoretically, we can provide 
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some characterisation of the type of environments we expect our agent to handle. 
To develop this approximation, we introduced two key algorithms: 

• Predictive UCT- a histories-as-states expectimax approximation algorithm; 

• action-conditional CTW - an agent- specific generalisation of the CTW algorithm. 
Furthermore, we demonstrated that our approach opens a number of future research areas: 

• incorporating background knowledge through the predicate CTW extension; 

• the possibility of constructing self-improving heuristic playout policies. 

Although we are a long way away from being able to construct a truly powerful gen- 
eral agent, the future looks promising. We hope this work generates further interest from 
the broader artificial intelligence community in both AIXI and general reinforcement 
learning agents. 
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